The tasks included in this report are:

unique(measure_labels$task_group)
 [1] "adaptive_n_back"             "attention_network_task"     
 [3] "choice_reaction_time"        "directed_forgetting"        
 [5] "dot_pattern_expectancy"      "local_global_letter"        
 [7] "motor_selective_stop_signal" "recent_probes"              
 [9] "shape_matching"              "simon"                      
[11] "stim_selective_stop_signal"  "stop_signal"                
[13] "stroop"                      "threebytwo"                 
adaptive_n_back

attention_network_task

choice_reaction_time

directed_forgetting

dot_pattern_expectancy

local_global_letter

motor_selective_stop_signal

recent_probes

shape_matching

simon

stim_selective_stop_signal

stop_signal

stroop

threebytwo

T1 HDDM parameters

Before going on with the rest of the report first decide on whether there are significant differences in the distributions of the HDDM parameter estimates and their reliabilities depending on whether they are fit on the full sample (n=552) or retest sample (n=150) for t1 data.

Parameter stability

Differences in distributions: using scaled differences

Does the distribution of scaled difference scores (between using n=150 or n=552) have a mean different than 0 allowing for random effects of parameter accounting for the different types of parameters? No.

if(!exists('hddm_pars')){
  source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/hddm_pars_data.R')
}
summary(lmer(scaled_diff ~ rt_acc + (1|dv), hddm_pars))
Linear mixed model fit by REML ['lmerMod']
Formula: scaled_diff ~ rt_acc + (1 | dv)
   Data: hddm_pars

REML criterion at convergence: 41695

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-10.438  -0.542  -0.004   0.518   9.750 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 3.24e-33 5.69e-17
 Residual             9.93e-01 9.97e-01
Number of obs: 14722, groups:  dv, 103

Fixed effects:
                    Estimate Std. Error t value
(Intercept)         6.82e-18   9.82e-03       0
rt_accnon-decision -2.70e-17   2.44e-02       0
rt_accthreshold     9.98e-18   2.25e-02       0

Correlation of Fixed Effects:
            (Intr) rt_cc-
rt_ccnn-dcs -0.403       
rt_ccthrshl -0.436  0.176
# Same result
# summary(MCMCglmm(scaled_diff ~ rt_acc, random = ~ dv, data=hddm_pars))

Parameter reliability

Are there differences in reliability depending which sample the HDDM parameters are estimated from? No.

Do hddm parameters differ in their reliability depending on whether they are derived from n=150 or n=552? No.

if(!exists('hddm_rels')){
  source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/hddm_rel_data.R')}

hddm_rels = hddm_rels %>%
  gather(sample, value, -dv, -rt_acc)

summary(lmer(value ~ sample*rt_acc + (1|dv), hddm_rels))
Linear mixed model fit by REML ['lmerMod']
Formula: value ~ sample * rt_acc + (1 | dv)
   Data: hddm_rels

REML criterion at convergence: -1207

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-3.492 -0.260  0.013  0.254  3.510 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.029746 0.172   
 Residual             0.000196 0.014   
Number of obs: 440, groups:  dv, 220

Fixed effects:
                               Estimate Std. Error t value
(Intercept)                     0.50848    0.01642   30.96
samplerefit                    -0.00331    0.00188   -1.76
rt_accnon-decision              0.03807    0.02889    1.32
rt_accthreshold                -0.02978    0.02836   -1.05
samplerefit:rt_accnon-decision -0.00394    0.00331   -1.19
samplerefit:rt_accthreshold     0.00497    0.00325    1.53

Correlation of Fixed Effects:
            (Intr) smplrf rt_cc- rt_cct smp:_-
samplerefit -0.057                            
rt_ccnn-dcs -0.568  0.033                     
rt_ccthrshl -0.579  0.033  0.329              
smplrft:r_-  0.033 -0.568 -0.057 -0.019       
smplrft:rt_  0.033 -0.579 -0.019 -0.057  0.329

Parameter fit

Do HDDM parameter estimates fit participant data better if the model prior is informed by a larger sample? To assess model fit we sampled from the posterior predictive for each subject and regressed the predicted RT’s on actual RT’s (using both raw RT’s as well as log-transformed RT’s). Our measure of fit is the (adjusted) \(R^2\) from these regressions for each participant.

Here we compare model fits between the same model using a prior using n=552 vs n=150.

Note: In calculating the fit statistics for the full sample (n=552) hierarchical models we used fewer samples from the posterior predictive due to computational constraints. We did, however, confirm that the number of samples from the posterior predictive did not have an effect on the fit statistics. Details of those analyses can be found here.

refit_fitstats %>%
  left_join(t1_hierarchical_fitstats, by = c("subj_id", "task_name")) %>%
  ggplot(aes(log_rsq_adj.x, log_rsq_adj.y))+
  geom_point(aes(color=task_name))+
  xlab("Adjusted r-sq for n=150")+
  ylab("Adjusted r-sq for n=552")+
  theme(legend.position = "none")+
  geom_abline(slope = 1, intercept=0)
Warning: Column `subj_id` joining factors with different levels, coercing
to character vector
Warning: Removed 106 rows containing missing values (geom_point).

The fit statistics (looking here at the adjusted \(R^2\)s using log-transformed RT’s) for the same models using n=552 and n=150 for the priors are highly correlated.

tmp = refit_fitstats %>%
  left_join(t1_hierarchical_fitstats, by = c("subj_id", "task_name"))
Warning: Column `subj_id` joining factors with different levels, coercing
to character vector
with(tmp, cor.test(log_rsq_adj.x, log_rsq_adj.y))

    Pearson's product-moment correlation

data:  log_rsq_adj.x and log_rsq_adj.y
t = 120, df = 1900, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9317 0.9425
sample estimates:
   cor 
0.9373 

And they do not systematically differ from each other.

with(tmp, t.test(log_rsq_adj.x, log_rsq_adj.y, paired=T))

    Paired t-test

data:  log_rsq_adj.x and log_rsq_adj.y
t = -1.1, df = 1900, p-value = 0.3
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0006482  0.0001833
sample estimates:
mean of the differences 
             -0.0002325 

Conclusion: Parameter estimates from refits to retest sample only for t1 data do not differ in any detectable way from estimates using the full t1 sample. They are also in principle more comparable to the parameter estimates from t2 and therefore used in the rest of this report.

DDM vs raw reliability overall

Plot reliability point estimates comparing DDM measures to raw measures faceting for contrast measures.

fig_name = 'ddmvsraw_point.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))

Plot averaged bootstrapped reliability estimates per measure comparing DDM measures to raw measures faceting for contrast measures.

fig_name = 'ddmvsraw_boot.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))

Model testing if the reliability of raw measures differs from that of ddm estimates and if contrast measures differ from non-contrast measures.

Checking if both fixed effects of raw vs ddm and contrast vs non-contrast as well as their interaction is necessary.

Conclusion: Interactive model is best.

mer1 = lmer(icc ~ ddm_raw + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer1a = lmer(icc ~ overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer2 = lmer(icc ~ ddm_raw + overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer3 = lmer(icc ~ ddm_raw * overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
anova(mer1, mer2, mer3)
refitting model(s) with ML (instead of REML)
anova(mer1a, mer2, mer3)
refitting model(s) with ML (instead of REML)
rm(mer1, mer2, mer1a)

Raw measures do not significantly differ from ddm parameters in their reliability but non-contrast measures are significantly more reliable compared to contrast and condition measures.

summary(mer3)
Linear mixed model fit by REML ['lmerMod']
Formula: icc ~ ddm_raw * overall_difference + (1 | dv)
   Data: boot_df %>% filter(rt_acc != "other")

REML criterion at convergence: -654256

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-130.34   -0.36    0.03    0.40   17.35 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.0518   0.228   
 Residual             0.0162   0.127   
Number of obs: 512000, groups:  dv, 512

Fixed effects:
                                       Estimate Std. Error t value
(Intercept)                              0.5947     0.0245   24.23
ddm_rawraw                              -0.0329     0.0395   -0.83
overall_differencecontrast              -0.3660     0.0366   -9.99
overall_differencecondition             -0.0865     0.0304   -2.84
ddm_rawraw:overall_differencecontrast    0.0790     0.0597    1.32
ddm_rawraw:overall_differencecondition  -0.0743     0.0490   -1.52

Correlation of Fixed Effects:
                        (Intr) ddm_rw ovrll_dffrnccnt ovrll_dffrnccnd
ddm_rawraw              -0.621                                       
ovrll_dffrnccnt         -0.670  0.416                                
ovrll_dffrnccnd         -0.806  0.501  0.540                         
ddm_rwrw:vrll_dffrnccnt  0.411 -0.662 -0.614          -0.331         
ddm_rwrw:vrll_dffrnccnd  0.501 -0.807 -0.336          -0.621         
                        ddm_rwrw:vrll_dffrnccnt
ddm_rawraw                                     
ovrll_dffrnccnt                                
ovrll_dffrnccnd                                
ddm_rwrw:vrll_dffrnccnt                        
ddm_rwrw:vrll_dffrnccnd  0.534                 

Best measure for each task

What is the best measure of individual difference for any measure that has both raw and DDM parameters?

Even though overall the ddm parameters are not significantly less reliable the most reliable measure is more frequently a raw measure. There are some examples of an EZ estimate being the best for a task as well. Regardless of raw vs ddm the best measure is always a non-contrast measure.

rel_df %>%
  group_by(task_group) %>%
  filter(icc == max(icc)) %>%
  select(task_group, everything())

Variance breakdown measure types

fig_name = 'ddmvsraw_varsubs.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))

fig_name = 'ddmvsraw_varresid.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))

Model testing if the percentage of between subjects variance of raw measures differs from that of ddm estimates and if contrast measures differ from non-contrast measures.

Checking if both fixed effects of raw vs ddm and contrast vs non-contrast as well as their interaction is necessary.

Conclusion: Model with fixed effects for both is best.

mer1 = lmer(var_subs_pct ~ ddm_raw + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer1a = lmer(var_subs_pct ~ overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer2 = lmer(var_subs_pct ~ ddm_raw +  overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer3 = lmer(var_subs_pct ~ ddm_raw *  overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
anova(mer1, mer2, mer3)
refitting model(s) with ML (instead of REML)
anova(mer1a, mer2, mer3)
refitting model(s) with ML (instead of REML)
rm(mer1, mer1a, mer3)

Contrast measures have lower between subjects variability (ie are worse individual difference measures). Raw and ddm measures do not differ significantly.

summary(mer2)
Linear mixed model fit by REML ['lmerMod']
Formula: var_subs_pct ~ ddm_raw + overall_difference + (1 | dv)
   Data: boot_df %>% filter(rt_acc != "other")

REML criterion at convergence: 3845550

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-4.485 -0.657  0.065  0.669  4.841 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 197      14.0    
 Residual             106      10.3    
Number of obs: 512000, groups:  dv, 512

Fixed effects:
                            Estimate Std. Error t value
(Intercept)                    50.59       1.29    39.4
ddm_rawraw                      2.30       1.28     1.8
overall_differencecontrast     -8.02       1.78    -4.5
overall_differencecondition    -2.57       1.47    -1.7

Correlation of Fixed Effects:
                (Intr) ddm_rw ovrll_dffrnccnt
ddm_rawraw      -0.383                       
ovrll_dffrnccnt -0.619  0.012                
ovrll_dffrnccnd -0.745 -0.001  0.536         

Model testing if the percentage of residual variance of raw measures differs from that of ddm estimates and if contrast measures differ from non-contrast measures.

Checking if both fixed effects of raw vs ddm and contrast vs non-contrast as well as their interaction is necessary.

Conclusion: Interactive model is best

mer1 = lmer(var_resid_pct ~ ddm_raw + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer1a = lmer(var_resid_pct ~ overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer2 = lmer(var_resid_pct ~ ddm_raw + overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer3 = lmer(var_resid_pct ~ ddm_raw * overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
anova(mer1, mer2, mer3)
refitting model(s) with ML (instead of REML)
anova(mer1a, mer2, mer3)
refitting model(s) with ML (instead of REML)
rm(mer1, mer1a, mer2)

Both contrast and condition measures have higher residual variance. Raw and ddm measures do not differ.

summary(mer3)
Linear mixed model fit by REML ['lmerMod']
Formula: var_resid_pct ~ ddm_raw * overall_difference + (1 | dv)
   Data: boot_df %>% filter(rt_acc != "other")

REML criterion at convergence: 3308034

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-6.179 -0.562  0.016  0.604  7.682 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 64.7     8.05    
 Residual             37.2     6.10    
Number of obs: 512000, groups:  dv, 512

Fixed effects:
                                       Estimate Std. Error t value
(Intercept)                              19.460      0.868   22.42
ddm_rawraw                                1.279      1.397    0.92
overall_differencecontrast               12.009      1.295    9.27
overall_differencecondition               2.783      1.076    2.59
ddm_rawraw:overall_differencecontrast    -1.387      2.111   -0.66
ddm_rawraw:overall_differencecondition    2.968      1.732    1.71

Correlation of Fixed Effects:
                        (Intr) ddm_rw ovrll_dffrnccnt ovrll_dffrnccnd
ddm_rawraw              -0.621                                       
ovrll_dffrnccnt         -0.670  0.416                                
ovrll_dffrnccnd         -0.806  0.501  0.540                         
ddm_rwrw:vrll_dffrnccnt  0.411 -0.662 -0.614          -0.331         
ddm_rwrw:vrll_dffrnccnd  0.501 -0.807 -0.336          -0.621         
                        ddm_rwrw:vrll_dffrnccnt
ddm_rawraw                                     
ovrll_dffrnccnt                                
ovrll_dffrnccnd                                
ddm_rwrw:vrll_dffrnccnt                        
ddm_rwrw:vrll_dffrnccnd  0.534                 
rm(mer3)

Sample size effects on reliability

Differences in DDM parameter reliability for t1 data using either n=552 or n=150 were reported above under T1 HDDM parameters. No meaningful differences exist between these two sample sizes.

But even 150 is a large sample size for psychological studies, especially forced choice reaction time tasks that are included in this report. Here we look at how the reliability for raw and ddm measures change for sample sizes that are more common in studies using these tasks (25, 50, 75, 100, 125, 150)

Note: Not refitting HDDM’s for each of these sample sizes since a. there were no differences in parameter stability for n=150 vs 552 and b. a more comprehensive comparison using non-hierarchical estimates and model fit indices will follow. [Should I revisit this? - 150 and 552 might be too large to lead to changes in parameter estimates but smaller samples that are more common in psych studies might sway estimates more]

Note: Some variables do not have enough variance to calculate reliability for difference sample sizes. These variables are:
>stroop.post_error_slowing
>simon.std_rt_error
>shape_matching.post_error_slowing
>directed_forgetting.post_error_slowing
>choice_reaction_time.post_error_slowing
>choice_reaction_time.std_rt_error
>dot_pattern_expectancy.post_error_slowing
>motor_selective_stop_signal.go_rt_std_error
>motor_selective_stop_signal.go_rt_error
>attention_network_task.post_error_slowing
>recent_probes.post_error_slowing
>simon.post_error_slowing
>dot_pattern_expectancy.BY_errors

source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/ddm_reldf_sample_size.R')
Warning: Column `dv` joining factor and character vector, coercing into
character vector

Does the mean reliability change with sample size?

Yes. The larger the sample size the more reliable is a given measure on average. The largest increase in reliability is when shifting from 25 to 50 subjects. This is important because many studies using these measures have sample sizes <50 per group.

fig_name = 'rel_by_samplesize.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))

When <15 subjects are used to calculate the measures they are significantly less reliable.

summary(lmer(icc ~ sample_size + (1|dv) + (1|iteration), rel_df_sample_size))
Linear mixed model fit by REML ['lmerMod']
Formula: icc ~ sample_size + (1 | dv) + (1 | iteration)
   Data: rel_df_sample_size

REML criterion at convergence: 2823504

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-605.6    0.0    0.0    0.0    0.4 

Random effects:
 Groups    Name        Variance Std.Dev.
 dv        (Intercept)  0.06526 0.2555  
 iteration (Intercept)  0.00427 0.0653  
 Residual              65.64977 8.1025  
Number of obs: 402034, groups:  dv, 505; iteration, 100

Fixed effects:
            Estimate Std. Error t value
(Intercept) 0.311251   0.024771   12.57
sample_size 0.001557   0.000316    4.92

Correlation of Fixed Effects:
            (Intr)
sample_size -0.673

Are there differences between any other sample sizes? This ignores the differences between variables but there seems to be only differences between n=10 and all other larger sample size.

with(rel_df_sample_size_summary, pairwise.t.test(mean_icc, sample_size, p.adjust.method = "bonferroni"))

    Pairwise comparisons using t tests with pooled SD 

data:  mean_icc and sample_size 

    10    15 20 25 50 75 100
15  2e-04 -  -  -  -  -  -  
20  9e-06 1  -  -  -  -  -  
25  2e-06 1  1  -  -  -  -  
50  9e-08 1  1  1  -  -  -  
75  4e-08 1  1  1  1  -  -  
100 3e-08 1  1  1  1  1  -  
125 2e-08 1  1  1  1  1  1  

P value adjustment method: bonferroni 

Does the change in reliabiliity with sample size vary by variable type?

The changes do not differ by raw vs. ddm measures. It does differ by whether the measure is a contrast measure: For contrast measures all increases in reliability with sample size are larger than the increases for non-contrast measures. This implies that larger sample sizes are even more crucial for studies using contrast measures as trait-level measures.

tmp = rel_df_sample_size %>% 
  group_by(sample_size, ddm_raw, overall_difference) %>%
  summarise(mean_icc = mean(icc, na.rm=T),
            sem_icc = sem(icc), 
            mean_var_subs_pct = mean(var_subs_pct, na.rm=T),
            sem_var_subs_pct = sem(var_subs_pct),
            mean_var_ind_pct = mean(var_ind_pct, na.rm=T),
            sem_var_ind_pct = sem(var_ind_pct),
            mean_var_resid_pct = mean(var_resid_pct, na.rm=T),
            sem_var_resid_pct = sem(var_resid_pct)) %>%
  na.exclude()
rel_df_sample_size_summary %>%
  filter(!is.na(overall_difference)) %>%
  ggplot(aes(factor(sample_size), mean_icc))+
  geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
  geom_line(data = tmp, aes(factor(sample_size),mean_icc, color=ddm_raw, group=ddm_raw))+
  geom_point(data = tmp, aes(factor(sample_size),mean_icc, color=ddm_raw))+
  geom_errorbar(data = tmp,aes(ymin=mean_icc-sem_icc, ymax = mean_icc+sem_icc, color=ddm_raw), width = 0.1)+
  facet_wrap(~overall_difference)+
  ylab("Mean reliability of 100 samples of size n")+
  xlab("Sample size")+
  theme(legend.title = element_blank(),
        legend.position = "bottom")+
  ylim(-1,1)
Warning: Removed 122 rows containing missing values (geom_path).

summary(lmer(icc ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
Linear mixed model fit by REML ['lmerMod']
Formula: 
icc ~ sample_size * overall_difference + (1 | dv) + (1 | iteration)
   Data: rel_df_sample_size

REML criterion at convergence: 2805380

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-603.4    0.0    0.0    0.0    0.3 

Random effects:
 Groups    Name        Variance Std.Dev.
 dv        (Intercept)  0.04637 0.2153  
 iteration (Intercept)  0.00433 0.0658  
 Residual              66.14275 8.1328  
Number of obs: 399034, groups:  dv, 499; iteration, 100

Fixed effects:
                                         Estimate Std. Error t value
(Intercept)                              0.518096   0.044356   11.68
sample_size                              0.000621   0.000602    1.03
overall_differencecontrast              -0.459839   0.065919   -6.98
overall_differencecondition             -0.211001   0.054844   -3.85
sample_size:overall_differencecontrast   0.001230   0.000905    1.36
sample_size:overall_differencecondition  0.001343   0.000753    1.78

Correlation of Fixed Effects:
                       (Intr) smpl_s ovrll_dffrnccnt ovrll_dffrnccnd
sample_size            -0.713                                       
ovrll_dffrnccnt        -0.658  0.480                                
ovrll_dffrnccnd        -0.791  0.577  0.532                         
smpl_sz:vrll_dffrnccnt  0.475 -0.665 -0.721          -0.384         
smpl_sz:vrll_dffrnccnd  0.571 -0.800 -0.384          -0.721         
                       smpl_sz:vrll_dffrnccnt
sample_size                                  
ovrll_dffrnccnt                              
ovrll_dffrnccnd                              
smpl_sz:vrll_dffrnccnt                       
smpl_sz:vrll_dffrnccnd  0.532                

Does variability of reliability change with sample size?

No.

rel_df_sample_size_summary %>%
  na.exclude() %>%
  ggplot(aes(sample_size, sem_icc))+
  geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
  geom_line(data = tmp, aes(sample_size,sem_icc, color=ddm_raw))+
  geom_point(data = tmp, aes(sample_size,sem_icc, color=ddm_raw))+
  facet_wrap(~overall_difference)+
  ylab("Standard error of mean of reliability \n of 100 samples of size n")+
  xlab("Sample size")+
  theme(legend.title = element_blank(),
        legend.position = "bottom")

summary(lmer(sem_icc ~ sample_size * overall_difference + (1|dv), rel_df_sample_size_summary))
Linear mixed model fit by REML ['lmerMod']
Formula: sem_icc ~ sample_size * overall_difference + (1 | dv)
   Data: rel_df_sample_size_summary

REML criterion at convergence: 9708

Scaled residuals: 
   Min     1Q Median     3Q    Max 
 -0.12  -0.06  -0.02   0.01  60.38 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.000213 0.0146  
 Residual             0.657690 0.8110  
Number of obs: 3992, groups:  dv, 499

Fixed effects:
                                         Estimate Std. Error t value
(Intercept)                              0.038630   0.039761    0.97
sample_size                             -0.000334   0.000600   -0.56
overall_differencecontrast               0.048118   0.059791    0.80
overall_differencecondition              0.090710   0.049734    1.82
sample_size:overall_differencecontrast  -0.000466   0.000902   -0.52
sample_size:overall_differencecondition -0.000986   0.000750   -1.31

Correlation of Fixed Effects:
                       (Intr) smpl_s ovrll_dffrnccnt ovrll_dffrnccnd
sample_size            -0.792                                       
ovrll_dffrnccnt        -0.665  0.527                                
ovrll_dffrnccnd        -0.799  0.633  0.532                         
smpl_sz:vrll_dffrnccnt  0.527 -0.665 -0.792          -0.421         
smpl_sz:vrll_dffrnccnd  0.633 -0.799 -0.421          -0.792         
                       smpl_sz:vrll_dffrnccnt
sample_size                                  
ovrll_dffrnccnt                              
ovrll_dffrnccnd                              
smpl_sz:vrll_dffrnccnt                       
smpl_sz:vrll_dffrnccnd  0.532                

Does between subjects variance change with sample size?

Yes. Between subjects variance decreases with sample size. This is more pronounced for non-contrast measures.

This goes against my intuitions. Looking at the change in between subjects percentage of individual measures’ there seems to be a lot of inter-measure variance (more pronounced below for within subject variance). I’m not sure if there is something in common for the measures that show increasing between subjects variability with sample size and that separates them from those that show decreasing between subjects variability with sample size (the slight majority).

rel_df_sample_size_summary %>%
  na.exclude() %>%
  ggplot(aes(sample_size, mean_var_subs_pct))+
  geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
  geom_line(data = tmp, aes(sample_size,mean_var_subs_pct, color=ddm_raw))+
  geom_point(data = tmp, aes(sample_size,mean_var_subs_pct, color=ddm_raw))+
  facet_wrap(~overall_difference)+
  ylab("Mean percentage of \n between subjects variance \n of 100 samples of size n")+
  xlab("Sample size")+
  theme(legend.title = element_blank(),
        legend.position = "bottom")

summary(lmer(var_subs_pct ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
Linear mixed model fit by REML ['lmerMod']
Formula: var_subs_pct ~ sample_size * overall_difference + (1 | dv) +  
    (1 | iteration)
   Data: rel_df_sample_size

REML criterion at convergence: 3274176

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-4.957 -0.705  0.054  0.704  4.556 

Random effects:
 Groups    Name        Variance Std.Dev.
 dv        (Intercept) 109.05   10.44   
 iteration (Intercept)   1.26    1.12   
 Residual              212.51   14.58   
Number of obs: 399034, groups:  dv, 499; iteration, 100

Fixed effects:
                                         Estimate Std. Error t value
(Intercept)                              58.88947    0.89254    66.0
sample_size                              -0.08360    0.00108   -77.5
overall_differencecontrast              -14.86635    1.33152   -11.2
overall_differencecondition              -5.22575    1.10755    -4.7
sample_size:overall_differencecontrast    0.07402    0.00162    45.6
sample_size:overall_differencecondition   0.03666    0.00135    27.2

Correlation of Fixed Effects:
                       (Intr) smpl_s ovrll_dffrnccnt ovrll_dffrnccnd
sample_size            -0.064                                       
ovrll_dffrnccnt        -0.660  0.043                                
ovrll_dffrnccnd        -0.793  0.051  0.532                         
smpl_sz:vrll_dffrnccnt  0.042 -0.665 -0.064          -0.034         
smpl_sz:vrll_dffrnccnd  0.051 -0.800 -0.034          -0.064         
                       smpl_sz:vrll_dffrnccnt
sample_size                                  
ovrll_dffrnccnt                              
ovrll_dffrnccnd                              
smpl_sz:vrll_dffrnccnt                       
smpl_sz:vrll_dffrnccnd  0.532                

Does within subjects variance change with sample size?

Yes. This again goes against my intuition but here the inter-meausre differences are even more pronounced. There appears to be some measures for which the change in two measurements at different time points is larger the more subjects are tested and those that show a smaller decrease in within subject variance with larger sample sizes. I still don’t know if these two types of measures have anything that distinguishes them.

rel_df_sample_size_summary %>%
  na.exclude() %>%
  ggplot(aes(sample_size, mean_var_ind_pct))+
  geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
  geom_line(data = tmp, aes(sample_size,mean_var_ind_pct, color=ddm_raw))+
  geom_point(data = tmp, aes(sample_size,mean_var_ind_pct, color=ddm_raw))+
  facet_wrap(~overall_difference)+
  ylab("Mean percentage of \n within subjects variance \n of 100 samples of size n")+
  xlab("Sample size")+
  theme(legend.title = element_blank(),
        legend.position = "bottom")

summary(lmer(var_ind_pct ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
Linear mixed model fit by REML ['lmerMod']
Formula: var_ind_pct ~ sample_size * overall_difference + (1 | dv) + (1 |  
    iteration)
   Data: rel_df_sample_size

REML criterion at convergence: 3471611

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-4.539 -0.749 -0.171  0.698  4.331 

Random effects:
 Groups    Name        Variance Std.Dev.
 dv        (Intercept) 132.28   11.50   
 iteration (Intercept)   1.41    1.19   
 Residual              348.71   18.67   
Number of obs: 399034, groups:  dv, 499; iteration, 100

Fixed effects:
                                        Estimate Std. Error t value
(Intercept)                             18.28380    0.98355    18.6
sample_size                              0.11368    0.00138    82.2
overall_differencecontrast               5.17440    1.46817     3.5
overall_differencecondition              2.01577    1.22121     1.7
sample_size:overall_differencecontrast  -0.09479    0.00208   -45.6
sample_size:overall_differencecondition -0.03786    0.00173   -21.9

Correlation of Fixed Effects:
                       (Intr) smpl_s ovrll_dffrnccnt ovrll_dffrnccnd
sample_size            -0.074                                       
ovrll_dffrnccnt        -0.660  0.050                                
ovrll_dffrnccnd        -0.794  0.060  0.532                         
smpl_sz:vrll_dffrnccnt  0.049 -0.665 -0.074          -0.040         
smpl_sz:vrll_dffrnccnd  0.059 -0.800 -0.040          -0.074         
                       smpl_sz:vrll_dffrnccnt
sample_size                                  
ovrll_dffrnccnt                              
ovrll_dffrnccnd                              
smpl_sz:vrll_dffrnccnt                       
smpl_sz:vrll_dffrnccnd  0.532                

Does residual variance change with sample size?

rel_df_sample_size_summary %>%
  na.exclude() %>%
  ggplot(aes(sample_size, mean_var_resid_pct))+
  geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
  geom_line(data = tmp, aes(sample_size,mean_var_resid_pct, color=ddm_raw))+
  geom_point(data = tmp, aes(sample_size,mean_var_resid_pct, color=ddm_raw))+
  facet_wrap(~overall_difference)+
  ylab("Mean percentage of residual variance \n of 100 samples of size n")+
  xlab("Sample size")+
  theme(legend.title = element_blank(),
        legend.position = "bottom")

summary(lmer(var_resid_pct ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
Linear mixed model fit by REML ['lmerMod']
Formula: var_resid_pct ~ sample_size * overall_difference + (1 | dv) +  
    (1 | iteration)
   Data: rel_df_sample_size

REML criterion at convergence: 2942570

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-4.562 -0.624 -0.040  0.559  8.279 

Random effects:
 Groups    Name        Variance Std.Dev.
 dv        (Intercept) 39.93    6.319   
 iteration (Intercept)  0.44    0.663   
 Residual              92.59    9.623   
Number of obs: 399034, groups:  dv, 499; iteration, 100

Fixed effects:
                                         Estimate Std. Error t value
(Intercept)                             22.826752   0.540234    42.3
sample_size                             -0.030083   0.000712   -42.2
overall_differencecontrast               9.691933   0.806229    12.0
overall_differencecondition              3.209963   0.670616     4.8
sample_size:overall_differencecontrast   0.020769   0.001071    19.4
sample_size:overall_differencecondition  0.001197   0.000891     1.3

Correlation of Fixed Effects:
                       (Intr) smpl_s ovrll_dffrnccnt ovrll_dffrnccnd
sample_size            -0.069                                       
ovrll_dffrnccnt        -0.660  0.046                                
ovrll_dffrnccnd        -0.793  0.056  0.532                         
smpl_sz:vrll_dffrnccnt  0.046 -0.665 -0.070          -0.037         
smpl_sz:vrll_dffrnccnd  0.055 -0.800 -0.037          -0.070         
                       smpl_sz:vrll_dffrnccnt
sample_size                                  
ovrll_dffrnccnt                              
ovrll_dffrnccnd                              
smpl_sz:vrll_dffrnccnt                       
smpl_sz:vrll_dffrnccnd  0.532                

Conclusion: Larger samples are better for reliability but not necessarily always for the same reasons; for some variables this is due to increasing between subjects variance while for others it’s due to decreasing residual variance (?).

rm(rel_df_sample_size, rel_df_sample_size_summary)

Hierarchical estimation consequences

The H in HDDM estimates stands for ‘hierarchical’ denoting the fact that the distribution of the whole sample is incorportated in the priors for the model parameters. This is different than e.g. EZ-diffusion parameters that would be the same for a given subject’s data regardless of the rest of the sample. One can ask whether this H indeed has a measurable impact.

HDDM and EZ parameter estimates might differ due to other difference in the estimation process as well. Therefore to evaluate whether the ‘H’ leads to meaningful changes in parameters we compare the same models fit on the same data using either the whole sample for the hierarchical structure or only the single subject’s data. These latter estimates are referred to as ‘flat’ estimates.

retest_hddm_flat = read.csv(paste0(retest_data_path,'retest_hddm_flat.csv'))

retest_hddm_flat = retest_hddm_flat %>% rename(sub_id = subj_id)

test_hddm_flat = read.csv(paste0(retest_data_path,'/t1_data/t1_hddm_flat.csv'))

test_hddm_flat = test_hddm_flat %>% rename(sub_id = subj_id)

# numeric_cols = get_numeric_cols()

# Check if all the variables are there (no for now)
# sum(names(retest_hddm_flat) %in% numeric_cols) == length(names(retest_hddm_flat))
# names(retest_hddm_flat)[which(names(retest_hddm_flat) %in% numeric_cols == FALSE)]

# sum(names(test_hddm_flat) %in% numeric_cols) == length(names(test_hddm_flat))
# names(test_hddm_flat)[which(names(test_hddm_flat) %in% numeric_cols == FALSE)]

Parameter estimate

Plot percent change in raw parameters for retest and t1 for each of 3 parameters (using hierarchical as baseline: what percent does the flat parameter change compared to the hierarchical)

For both time points the thresholds and non-decision times are very similary regardless of whether they are estimated heirarchically or without the hierarchy (difference peaking at 0).

This is also true for the majority of the drift rates. There are, however, almost as many drift rates that change completely when estimated without the hierarchy.

#Should not be necessary later
common_cols = names(retest_hddm_flat)[names(retest_hddm_flat) %in% names(retest_data)]
common_cols=common_cols[common_cols %in% names(test_hddm_flat)]

retest_hddm_hier = retest_data %>% select(common_cols) %>% mutate(hddm="hierarchical")
test_hddm_hier = test_data %>% select(common_cols)  %>% mutate(hddm="hierarchical")
retest_hddm_flat = retest_hddm_flat %>% select(common_cols) %>% mutate(hddm="flat")
test_hddm_flat = test_hddm_flat %>% select(common_cols) %>% mutate(hddm="flat")

retest_flat_difference = rbind(retest_hddm_hier, retest_hddm_flat)
retest_flat_difference = retest_flat_difference %>%
  gather(dv, value, -sub_id, -hddm) %>%
  spread(hddm, value) %>%
  mutate(diff_pct = (hierarchical - flat)/hierarchical*100,
         diff_pct = ifelse(diff_pct<(-100), -100, ifelse(diff_pct>100, 100, diff_pct)),
         time = "retest",
         par = ifelse(grepl("drift", dv), "drift", ifelse(grepl("thresh", dv), "thresh", ifelse(grepl("non_decision", dv), "non_decision", NA))))

test_flat_difference = rbind(test_hddm_hier, test_hddm_flat)
test_flat_difference = test_flat_difference %>%
  gather(dv, value, -sub_id, -hddm) %>%
  spread(hddm, value) %>%
  mutate(diff_pct = (hierarchical - flat)/hierarchical*100,
         diff_pct = ifelse(diff_pct<(-100), -100, ifelse(diff_pct>100, 100, diff_pct)),
         time = "test",
         par = ifelse(grepl("drift", dv), "drift", ifelse(grepl("thresh", dv), "thresh", ifelse(grepl("non_decision", dv), "non_decision", NA))))

flat_difference = rbind(test_flat_difference, retest_flat_difference)

flat_difference %>%
  ggplot(aes(diff_pct))+
  geom_histogram()+
  facet_grid(factor(time, levels = c("test", "retest"),labels = c("test", "retest")) ~ par, scales="free")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1022 rows containing non-finite values (stat_bin).

Is the average percentage change different than 0? No.

This model uses the distribution of percentage difference in parameter estimates for drift rates as the baseline. A more appropriate model would test whether all three distributions are different than 0. The conclusion should not change: The intercept suggests that the mean of the drift rate difference distribution is at 0 and this is not different than the distributions for either of the other parameters and either time point.

summary(lmer(diff_pct ~ time*par+(1|dv), flat_difference %>% mutate(pars = factor(par, levels = c("thresh", "drift", "non_decision")))))
Linear mixed model fit by REML ['lmerMod']
Formula: diff_pct ~ time * par + (1 | dv)
   Data: 
flat_difference %>% mutate(pars = factor(par, levels = c("thresh",  
    "drift", "non_decision")))

REML criterion at convergence: 238677

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-3.244 -0.258 -0.016  0.213  3.853 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept)  130     11.4    
 Residual             1443     38.0    
Number of obs: 23578, groups:  dv, 82

Fixed effects:
                         Estimate Std. Error t value
(Intercept)                 -1.19       1.64   -0.73
timetest                     1.64       0.62    2.65
parnon_decision              1.65       3.57    0.46
parthresh                   -1.84       3.39   -0.54
timetest:parnon_decision    -1.38       1.35   -1.02
timetest:parthresh          -1.71       1.28   -1.33

Correlation of Fixed Effects:
            (Intr) timtst prnn_d prthrs tmts:_
timetest    -0.191                            
parnon_dcsn -0.460  0.088                     
parthresh   -0.485  0.093  0.223              
tmtst:prnn_  0.088 -0.459 -0.191 -0.043       
tmtst:prthr  0.092 -0.483 -0.042 -0.192  0.222

Do people change in the same way? Mostly.

Plotting the raw parameter estimate that is estimated hierarchically against the estimate that is estimated without the hierarchy. Red lines are the 45-degree line. The fact that many points cluster around this line and that there are systematic deviances from it for any parameter at either time point suggests that the hierarchical and flat estimates are mostly the same.

flat_difference %>%
  ggplot(aes(hierarchical, flat))+
  geom_point()+
  geom_abline(aes(intercept=0, slope=1), color="red")+
  facet_grid(factor(time, levels = c("test", "retest"),labels = c("test", "retest")) ~ par)
Warning: Removed 1022 rows containing missing values (geom_point).

Parameter reliability

Plotting the reliability estimates for flat parameters against the reliability of hierarchical estimates. Red lines are 45-degree lines. While there are some changes in reliability nothing appears very large (or consequential in pushing the reliability to any acceptable level depending on the estimation method) or systematic.

rel_df_flat = make_rel_df(t1_df = test_hddm_flat, t2_df = retest_hddm_flat, metrics = c('icc', 'pearson', 'var_breakdown'))

rel_df_flat = rel_df_flat %>%
  left_join(rel_df[,c("dv", "icc", "rt_acc", "overall_difference")], by = "dv") 

rel_df_flat %>%
  ggplot(aes(icc.y, icc.x))+
  geom_point()+
  geom_abline(aes(slope=1, intercept = 0), color="red")+
  facet_wrap(~rt_acc)+
  xlab("Hierarchical reliability")+
  ylab("Flat reliability")

with(rel_df_flat %>% filter(rt_acc == "drift rate"), t.test(icc.x, icc.y, paired=T))

    Paired t-test

data:  icc.x and icc.y
t = 0.53, df = 51, p-value = 0.6
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.02018  0.03454
sample estimates:
mean of the differences 
               0.007181 
with(rel_df_flat %>% filter(rt_acc == "threshold"), t.test(icc.x, icc.y, paired=T))

    Paired t-test

data:  icc.x and icc.y
t = -2, df = 15, p-value = 0.06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.088179  0.002222
sample estimates:
mean of the differences 
               -0.04298 
with(rel_df_flat %>% filter(rt_acc == "non-decision"), t.test(icc.x, icc.y, paired=T))

    Paired t-test

data:  icc.x and icc.y
t = -0.19, df = 13, p-value = 0.9
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.04691  0.03948
sample estimates:
mean of the differences 
              -0.003715 
rm(flat_difference, rel_df_flat, retest_flat_difference, retest_hddm_flat, retest_hddm_hier, test_flat_difference,  test_hddm_flat, test_hddm_hier)

Parameter fit

Do the fit statistics differ by whether the model was hierarchical or not? No.

t1_hierarchical_fitstats = t1_hierarchical_fitstats %>%
  filter(subj_id %in% retest_hierarchical_fitstats$subj_id)

tmp = rbind(retest_hierarchical_fitstats, retest_flat_fitstats, t1_hierarchical_fitstats, t1_flat_fitstats) %>%
  separate(sample, c("time", "proc"), sep="_") %>%
  select(log_rsq_adj, time, proc, subj_id, task_name) %>%
  spread(proc, log_rsq_adj) 

tmp %>%
  ggplot(aes(hierarchical, flat))+
  geom_point()+
  geom_abline(aes(slope=1, intercept=0), color="red")+
  facet_wrap(~time)
Warning: Removed 271 rows containing missing values (geom_point).

with(tmp %>% filter(time == "t1"), cor.test(flat, hierarchical))

    Pearson's product-moment correlation

data:  flat and hierarchical
t = 59, df = 1900, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7855 0.8174
sample estimates:
   cor 
0.8021 
with(tmp %>% filter(time == "retest"), cor.test(flat, hierarchical))

    Pearson's product-moment correlation

data:  flat and hierarchical
t = 62, df = 2000, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7942 0.8244
sample estimates:
   cor 
0.8098 
with(tmp %>% filter(time == "t1"), t.test(flat, hierarchical, paired=TRUE))

    Paired t-test

data:  flat and hierarchical
t = 0.16, df = 1900, p-value = 0.9
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0007680  0.0009064
sample estimates:
mean of the differences 
             0.00006916 
with(tmp %>% filter(time == "retest"), t.test(flat, hierarchical, paired=TRUE))

    Paired t-test

data:  flat and hierarchical
t = 1.2, df = 2000, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0002551  0.0010165
sample estimates:
mean of the differences 
              0.0003807 

Clustering

– Do DDM parameters capture similar processes as the raw measures in a given task or do they capture processes that are more similar across tasks? (If the former they would be less useful than if the latter.)

This could be analyzed with factor analysis but there are more variables than observations so as a first pass we’ll explore correlations.

For this we correlate each DDM measure with:
1. raw measures within task
2. other ddm measures across tasks

(Absolute) correlations between raw and ddm measures within a task are higher than those between ddm measures across tasks.

Factor analysis

– Are these clusters more reliable than using either the raw or the DDM measures alone?

Prediction

– Do raw or DDM measures (or factor scores) predict real world outcomes better?

To Do

  • Alternate DDMs
  • Time and accuracy costs vs DDMs (Hedge et al, 2018, Psych Bull)
  • Trial by trial variability as individual difference measres (Rouder and Haaf, psyArXiv)
---
title: 'Self Regulation Ontology DDM Analyses'
output:
github_document:
toc: yes
toc_float: yes
---

```{r, message=FALSE, warning=FALSE, include=FALSE}
source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/SRO_DDM_Analyses_Workspace.R')
```

The tasks included in this report are:  

```{r}
unique(measure_labels$task_group)
```

## T1 HDDM parameters

Before going on with the rest of the report first decide on whether there are significant differences in the distributions of the HDDM parameter estimates and their reliabilities depending on whether they are fit on the full sample (n=552) or retest sample (n=150) for t1 data.  

### Parameter stability

Differences in distributions: using scaled differences

```{r echo=FALSE, out.width='100%'}
fig_name = 'HDDM_par_diff.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))
```

Does the distribution of scaled difference scores (between using n=150 or n=552) have a mean different than 0 allowing for random effects of parameter accounting for the different types of parameters? No.

```{r}
if(!exists('hddm_pars')){
  source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/hddm_pars_data.R')
}
summary(lmer(scaled_diff ~ rt_acc + (1|dv), hddm_pars))
# Same result
# summary(MCMCglmm(scaled_diff ~ rt_acc, random = ~ dv, data=hddm_pars))
```

### Parameter reliability

Are there differences in reliability depending which sample the HDDM parameters are estimated from? No.

```{r echo=FALSE, out.width='100%'}
fig_name = 'HDDM_rel_150vs552.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))
```

Do hddm parameters differ in their reliability depending on whether they are derived from n=150 or n=552? No.

```{r}
if(!exists('hddm_rels')){
  source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/hddm_rel_data.R')}

hddm_rels = hddm_rels %>%
  gather(sample, value, -dv, -rt_acc)

summary(lmer(value ~ sample*rt_acc + (1|dv), hddm_rels))
```

### Parameter fit

Do HDDM parameter estimates fit participant data better if the model prior is informed by a larger sample? To assess model fit we sampled from the posterior predictive for each subject and regressed the predicted RT's on actual RT's (using both raw RT's as well as log-transformed RT's). Our measure of fit is the (adjusted) $R^2$ from these regressions for each participant.

Here we compare model fits between the same model using a prior using n=552 vs n=150.

*Note:* In calculating the fit statistics for the full sample (n=552) hierarchical models we used fewer samples from the posterior predictive due to computational constraints. We did, however, confirm that the number of samples from the posterior predictive did not have an effect on the fit statistics. Details of those analyses can be found [here](https://zenkavi.github.io/SRO_DDM_Analyses/output/reports/NumSamplesAnalysis.nb.html).

```{r message=FALSE}
refit_fitstats %>%
  left_join(t1_hierarchical_fitstats, by = c("subj_id", "task_name")) %>%
  ggplot(aes(log_rsq_adj.x, log_rsq_adj.y))+
  geom_point(aes(color=task_name))+
  xlab("Adjusted r-sq for n=150")+
  ylab("Adjusted r-sq for n=552")+
  theme(legend.position = "none")+
  geom_abline(slope = 1, intercept=0)
```

The fit statistics (looking here at the adjusted $R^2$s using log-transformed RT's) for the same models using n=552 and n=150 for the priors are highly correlated.

```{r}
tmp = refit_fitstats %>%
  left_join(t1_hierarchical_fitstats, by = c("subj_id", "task_name"))

with(tmp, cor.test(log_rsq_adj.x, log_rsq_adj.y))
```

And they do not systematically differ from each other.

```{r}
with(tmp, t.test(log_rsq_adj.x, log_rsq_adj.y, paired=T))
```

*Conclusion:* Parameter estimates from refits to retest sample only for t1 data do not differ in any detectable way from estimates using the full t1 sample. They are also in principle more comparable to the parameter estimates from t2 and therefore used in the rest of this report.


```{r echo=FALSE}
#Workspace clean up
rm(tmp)

rm(hddm_pars, hddm_rels)

boot_df = fullfit_boot_df %>%
  filter(dv %in% unique(refit_boot_df$dv) == FALSE)

boot_df = rbind(boot_df, refit_boot_df)

rm(fullfit_boot_df, refit_boot_df)

rel_df = rel_df_refit

rm(rel_df_fullfit, rel_df_refit)

test_data = test_data_hddm_refit

rm(test_data_hddm_fullfit, test_data_hddm_refit)
```

## DDM vs raw reliability overall

Plot reliability point estimates comparing DDM measures to raw measures faceting for contrast measures.

```{r}
fig_name = 'ddmvsraw_point.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))
```

Plot averaged bootstrapped reliability estimates per measure comparing DDM measures to raw measures faceting for contrast measures.

```{r}
fig_name = 'ddmvsraw_boot.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))
```

Model testing if the reliability of raw measures differs from that of ddm estimates and if contrast measures differ from non-contrast measures.

Checking if both fixed effects of raw vs ddm and contrast vs non-contrast as well as their interaction is necessary.

Conclusion: Interactive model is best.

```{r}
mer1 = lmer(icc ~ ddm_raw + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer1a = lmer(icc ~ overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer2 = lmer(icc ~ ddm_raw + overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer3 = lmer(icc ~ ddm_raw * overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
anova(mer1, mer2, mer3)
anova(mer1a, mer2, mer3)
```

```{r}
rm(mer1, mer2, mer1a)
```

Raw measures do not significantly differ from ddm parameters in their reliability but non-contrast measures are significantly more reliable compared to contrast and condition measures.

```{r}
summary(mer3)
```

### Best measure for each task

What is the best measure of individual difference for any measure that has both raw and DDM parameters? 

Even though overall the ddm parameters are not significantly less reliable the most reliable measure is more frequently a raw measure. There are some examples of an EZ estimate being the best for a task as well. Regardless of raw vs ddm the best measure is always a non-contrast measure. 

```{r}
rel_df %>%
  group_by(task_group) %>%
  filter(icc == max(icc)) %>%
  select(task_group, everything())
```

### Variance breakdown measure types

```{r}
fig_name = 'ddmvsraw_varsubs.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))
```

```{r}
fig_name = 'ddmvsraw_varresid.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))
```

Model testing if the percentage of between subjects variance of raw measures differs from that of ddm estimates and if contrast measures differ from non-contrast measures.

Checking if both fixed effects of raw vs ddm and contrast vs non-contrast as well as their interaction is necessary.

Conclusion: Model with fixed effects for both is best.

```{r}
mer1 = lmer(var_subs_pct ~ ddm_raw + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer1a = lmer(var_subs_pct ~ overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer2 = lmer(var_subs_pct ~ ddm_raw +  overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer3 = lmer(var_subs_pct ~ ddm_raw *  overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
anova(mer1, mer2, mer3)
anova(mer1a, mer2, mer3)
```

```{r}
rm(mer1, mer1a, mer3)
```

Contrast measures have lower between subjects variability (ie are worse individual difference measures). Raw and ddm measures do not differ significantly.

```{r}
summary(mer2)
```

Model testing if the percentage of residual variance of raw measures differs from that of ddm estimates and if contrast measures differ from non-contrast measures.

Checking if both fixed effects of raw vs ddm and contrast vs non-contrast as well as their interaction is necessary.

Conclusion: Interactive model is best

```{r}
mer1 = lmer(var_resid_pct ~ ddm_raw + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer1a = lmer(var_resid_pct ~ overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer2 = lmer(var_resid_pct ~ ddm_raw + overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer3 = lmer(var_resid_pct ~ ddm_raw * overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
anova(mer1, mer2, mer3)
anova(mer1a, mer2, mer3)
```

```{r}
rm(mer1, mer1a, mer2)
```

Both contrast and condition measures have higher residual variance. Raw and ddm measures do not differ.

```{r}
summary(mer3)
```

```{r}
rm(mer3)
```

## Sample size effects on reliability

Differences in DDM parameter reliability for t1 data using either n=552 or n=150 were reported above under [T1 HDDM parameters](https://zenkavi.github.io/SRO_DDM_Analyses/output/reports/SRO_DDM_Analyses.nb.html#t1_hddm_parameters). No meaningful differences exist between these two sample sizes.

But even 150 is a large sample size for psychological studies, especially forced choice reaction time tasks that are included in this report. Here we look at how the reliability for raw and ddm measures change for sample sizes that are more common in studies using these tasks (25, 50, 75, 100, 125, 150)

*Note:* Not refitting HDDM's for each of these sample sizes since a. there were no differences in parameter stability for n=150 vs 552 and b. a more comprehensive comparison using non-hierarchical estimates and model fit indices will follow. *[Should I revisit this? - 150 and 552 might be too large to lead to changes in parameter estimates but smaller samples that are more common in psych studies might sway estimates more]*

*Note:* Some variables do not have enough variance to calculate reliability for difference sample sizes. These variables are:  
>stroop.post_error_slowing  
>simon.std_rt_error  
>shape_matching.post_error_slowing  
>directed_forgetting.post_error_slowing  
>choice_reaction_time.post_error_slowing  
>choice_reaction_time.std_rt_error  
>dot_pattern_expectancy.post_error_slowing  
>motor_selective_stop_signal.go_rt_std_error  
>motor_selective_stop_signal.go_rt_error  
>attention_network_task.post_error_slowing  
>recent_probes.post_error_slowing  
>simon.post_error_slowing  
>dot_pattern_expectancy.BY_errors  

```{r}
source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/ddm_reldf_sample_size.R')
```

Does the mean reliability change with sample size?

Yes. The larger the sample size the more reliable is a given measure on average. The largest increase in reliability is when shifting from 25 to 50 subjects. This is important because many studies using these measures have sample sizes <50 per group.

```{r}
fig_name = 'rel_by_samplesize.jpeg'

knitr::include_graphics(paste0(fig_path, fig_name))
```

When <15 subjects are used to calculate the measures they are significantly less reliable.

```{r}
summary(lmer(icc ~ sample_size + (1|dv) + (1|iteration), rel_df_sample_size))
```

Are there differences between any other sample sizes? This ignores the differences between variables but there seems to be only differences between n=10 and all other larger sample size.

```{r}
with(rel_df_sample_size_summary, pairwise.t.test(mean_icc, sample_size, p.adjust.method = "bonferroni"))
```

Does the change in reliabiliity with sample size vary by variable type?

The changes do not differ by raw vs. ddm measures. It does differ by whether the measure is a contrast measure: For contrast measures all increases in reliability with sample size are larger than the increases for non-contrast measures. This implies that larger sample sizes are even more crucial for studies using contrast measures as trait-level measures.

```{r}
tmp = rel_df_sample_size %>% 
  group_by(sample_size, ddm_raw, overall_difference) %>%
  summarise(mean_icc = mean(icc, na.rm=T),
            sem_icc = sem(icc), 
            mean_var_subs_pct = mean(var_subs_pct, na.rm=T),
            sem_var_subs_pct = sem(var_subs_pct),
            mean_var_ind_pct = mean(var_ind_pct, na.rm=T),
            sem_var_ind_pct = sem(var_ind_pct),
            mean_var_resid_pct = mean(var_resid_pct, na.rm=T),
            sem_var_resid_pct = sem(var_resid_pct)) %>%
  na.exclude()
```

```{r}
rel_df_sample_size_summary %>%
  filter(!is.na(overall_difference)) %>%
  ggplot(aes(factor(sample_size), mean_icc))+
  geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
  geom_line(data = tmp, aes(factor(sample_size),mean_icc, color=ddm_raw, group=ddm_raw))+
  geom_point(data = tmp, aes(factor(sample_size),mean_icc, color=ddm_raw))+
  geom_errorbar(data = tmp,aes(ymin=mean_icc-sem_icc, ymax = mean_icc+sem_icc, color=ddm_raw), width = 0.1)+
  facet_wrap(~overall_difference)+
  ylab("Mean reliability of 100 samples of size n")+
  xlab("Sample size")+
  theme(legend.title = element_blank(),
        legend.position = "bottom")+
  ylim(-1,1)
```


```{r}
summary(lmer(icc ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
```

Does variability of reliability change with sample size?

No.

```{r}
rel_df_sample_size_summary %>%
  na.exclude() %>%
  ggplot(aes(sample_size, sem_icc))+
  geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
  geom_line(data = tmp, aes(sample_size,sem_icc, color=ddm_raw))+
  geom_point(data = tmp, aes(sample_size,sem_icc, color=ddm_raw))+
  facet_wrap(~overall_difference)+
  ylab("Standard error of mean of reliability \n of 100 samples of size n")+
  xlab("Sample size")+
  theme(legend.title = element_blank(),
        legend.position = "bottom")
```

```{r}
summary(lmer(sem_icc ~ sample_size * overall_difference + (1|dv), rel_df_sample_size_summary))
```

Does between subjects variance change with sample size?

Yes. Between subjects variance decreases with sample size. This is more pronounced for non-contrast measures.

This goes against my intuitions. Looking at the change in between subjects percentage of individual measures' there seems to be a lot of inter-measure variance (more pronounced below for within subject variance). I'm not sure if there is something in common for the measures that show increasing between subjects variability with sample size and that separates them from those that show decreasing between subjects variability with sample size (the slight majority).

```{r}
rel_df_sample_size_summary %>%
  na.exclude() %>%
  ggplot(aes(sample_size, mean_var_subs_pct))+
  geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
  geom_line(data = tmp, aes(sample_size,mean_var_subs_pct, color=ddm_raw))+
  geom_point(data = tmp, aes(sample_size,mean_var_subs_pct, color=ddm_raw))+
  facet_wrap(~overall_difference)+
  ylab("Mean percentage of \n between subjects variance \n of 100 samples of size n")+
  xlab("Sample size")+
  theme(legend.title = element_blank(),
        legend.position = "bottom")
```

```{r}
summary(lmer(var_subs_pct ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
```

Does within subjects variance change with sample size?

Yes. This again goes against my intuition but here the inter-meausre differences are even more pronounced. There appears to be some measures for which the change in two measurements at different time points is larger the more subjects are tested and those that show a smaller decrease in within subject variance with larger sample sizes. I still don't know if these two types of measures have anything that distinguishes them.

```{r}
rel_df_sample_size_summary %>%
  na.exclude() %>%
  ggplot(aes(sample_size, mean_var_ind_pct))+
  geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
  geom_line(data = tmp, aes(sample_size,mean_var_ind_pct, color=ddm_raw))+
  geom_point(data = tmp, aes(sample_size,mean_var_ind_pct, color=ddm_raw))+
  facet_wrap(~overall_difference)+
  ylab("Mean percentage of \n within subjects variance \n of 100 samples of size n")+
  xlab("Sample size")+
  theme(legend.title = element_blank(),
        legend.position = "bottom")
```

```{r}
summary(lmer(var_ind_pct ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
```

Does residual variance change with sample size?

```{r}
rel_df_sample_size_summary %>%
  na.exclude() %>%
  ggplot(aes(sample_size, mean_var_resid_pct))+
  geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
  geom_line(data = tmp, aes(sample_size,mean_var_resid_pct, color=ddm_raw))+
  geom_point(data = tmp, aes(sample_size,mean_var_resid_pct, color=ddm_raw))+
  facet_wrap(~overall_difference)+
  ylab("Mean percentage of residual variance \n of 100 samples of size n")+
  xlab("Sample size")+
  theme(legend.title = element_blank(),
        legend.position = "bottom")
```

```{r}
summary(lmer(var_resid_pct ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
```

*Conclusion:* Larger samples are better for reliability but not necessarily always for the same reasons; for some variables this is due to increasing between subjects variance while for others it's due to decreasing residual variance (?).

```{r}
rm(rel_df_sample_size, rel_df_sample_size_summary)
```

## Hierarchical estimation consequences

The H in HDDM estimates stands for 'hierarchical' denoting the fact that the distribution of the whole sample is incorportated in the priors for the model parameters. This is different than e.g. EZ-diffusion parameters that would be the same for a given subject's data regardless of the rest of the sample. One can ask whether this H indeed has a measurable impact.

HDDM and EZ parameter estimates might differ due to other difference in the estimation process as well. Therefore to evaluate whether the 'H' leads to meaningful changes in parameters we compare the same models fit on the same data using either the whole sample for the hierarchical structure or only the single subject's data. These latter estimates are referred to as 'flat' estimates.

```{r}
retest_hddm_flat = read.csv(paste0(retest_data_path,'retest_hddm_flat.csv'))

retest_hddm_flat = retest_hddm_flat %>% rename(sub_id = subj_id)

test_hddm_flat = read.csv(paste0(retest_data_path,'/t1_data/t1_hddm_flat.csv'))

test_hddm_flat = test_hddm_flat %>% rename(sub_id = subj_id)

# numeric_cols = get_numeric_cols()

# Check if all the variables are there (no for now)
# sum(names(retest_hddm_flat) %in% numeric_cols) == length(names(retest_hddm_flat))
# names(retest_hddm_flat)[which(names(retest_hddm_flat) %in% numeric_cols == FALSE)]

# sum(names(test_hddm_flat) %in% numeric_cols) == length(names(test_hddm_flat))
# names(test_hddm_flat)[which(names(test_hddm_flat) %in% numeric_cols == FALSE)]
```

### Parameter estimate

Plot percent change in raw parameters for retest and t1 for each of 3 parameters (using hierarchical as baseline: what percent does the flat parameter change compared to the hierarchical)

For both time points the thresholds and non-decision times are very similary regardless of whether they are estimated heirarchically or without the hierarchy (difference peaking at 0). 

This is also true for the majority of the drift rates. There are, however, almost as many drift rates that change completely when estimated without the hierarchy.  

```{r}
#Should not be necessary later
common_cols = names(retest_hddm_flat)[names(retest_hddm_flat) %in% names(retest_data)]
common_cols=common_cols[common_cols %in% names(test_hddm_flat)]

retest_hddm_hier = retest_data %>% select(common_cols) %>% mutate(hddm="hierarchical")
test_hddm_hier = test_data %>% select(common_cols)  %>% mutate(hddm="hierarchical")
retest_hddm_flat = retest_hddm_flat %>% select(common_cols) %>% mutate(hddm="flat")
test_hddm_flat = test_hddm_flat %>% select(common_cols) %>% mutate(hddm="flat")

retest_flat_difference = rbind(retest_hddm_hier, retest_hddm_flat)
retest_flat_difference = retest_flat_difference %>%
  gather(dv, value, -sub_id, -hddm) %>%
  spread(hddm, value) %>%
  mutate(diff_pct = (hierarchical - flat)/hierarchical*100,
         diff_pct = ifelse(diff_pct<(-100), -100, ifelse(diff_pct>100, 100, diff_pct)),
         time = "retest",
         par = ifelse(grepl("drift", dv), "drift", ifelse(grepl("thresh", dv), "thresh", ifelse(grepl("non_decision", dv), "non_decision", NA))))

test_flat_difference = rbind(test_hddm_hier, test_hddm_flat)
test_flat_difference = test_flat_difference %>%
  gather(dv, value, -sub_id, -hddm) %>%
  spread(hddm, value) %>%
  mutate(diff_pct = (hierarchical - flat)/hierarchical*100,
         diff_pct = ifelse(diff_pct<(-100), -100, ifelse(diff_pct>100, 100, diff_pct)),
         time = "test",
         par = ifelse(grepl("drift", dv), "drift", ifelse(grepl("thresh", dv), "thresh", ifelse(grepl("non_decision", dv), "non_decision", NA))))

flat_difference = rbind(test_flat_difference, retest_flat_difference)

flat_difference %>%
  ggplot(aes(diff_pct))+
  geom_histogram()+
  facet_grid(factor(time, levels = c("test", "retest"),labels = c("test", "retest")) ~ par, scales="free")
```

Is the average percentage change different than 0? No.

This model uses the distribution of percentage difference in parameter estimates for drift rates as the baseline. A more appropriate model would test whether all three distributions are different than 0. The conclusion should not change: The intercept suggests that the mean of the drift rate difference distribution is at 0 and this is not different than the distributions for either of the other parameters and either time point.

```{r}
summary(lmer(diff_pct ~ time*par+(1|dv), flat_difference %>% mutate(pars = factor(par, levels = c("thresh", "drift", "non_decision")))))
```

Do people change in the same way? Mostly.

Plotting the raw parameter estimate that is estimated hierarchically against the estimate that is estimated without the hierarchy. Red lines are the 45-degree line. The fact that many points cluster around this line and that there are systematic deviances from it for any parameter at either time point suggests that the hierarchical and flat estimates are mostly the same.

```{r}
flat_difference %>%
  ggplot(aes(hierarchical, flat))+
  geom_point()+
  geom_abline(aes(intercept=0, slope=1), color="red")+
  facet_grid(factor(time, levels = c("test", "retest"),labels = c("test", "retest")) ~ par)
```

### Parameter reliability

Plotting the reliability estimates for flat parameters against the reliability of hierarchical estimates. Red lines are 45-degree lines. While there are some changes in reliability nothing appears very large (or consequential in pushing the reliability to any acceptable level depending on the estimation method) or systematic.

```{r}
rel_df_flat = make_rel_df(t1_df = test_hddm_flat, t2_df = retest_hddm_flat, metrics = c('icc', 'pearson', 'var_breakdown'))

rel_df_flat = rel_df_flat %>%
  left_join(rel_df[,c("dv", "icc", "rt_acc", "overall_difference")], by = "dv") 

rel_df_flat %>%
  ggplot(aes(icc.y, icc.x))+
  geom_point()+
  geom_abline(aes(slope=1, intercept = 0), color="red")+
  facet_wrap(~rt_acc)+
  xlab("Hierarchical reliability")+
  ylab("Flat reliability")
```

```{r}
with(rel_df_flat %>% filter(rt_acc == "drift rate"), t.test(icc.x, icc.y, paired=T))
with(rel_df_flat %>% filter(rt_acc == "threshold"), t.test(icc.x, icc.y, paired=T))
with(rel_df_flat %>% filter(rt_acc == "non-decision"), t.test(icc.x, icc.y, paired=T))
```

```{r}
rm(flat_difference, rel_df_flat, retest_flat_difference, retest_hddm_flat, retest_hddm_hier, test_flat_difference,  test_hddm_flat, test_hddm_hier)
```

### Parameter fit

Do the fit statistics differ by whether the model was hierarchical or not? No.

```{r}
t1_hierarchical_fitstats = t1_hierarchical_fitstats %>%
  filter(subj_id %in% retest_hierarchical_fitstats$subj_id)

tmp = rbind(retest_hierarchical_fitstats, retest_flat_fitstats, t1_hierarchical_fitstats, t1_flat_fitstats) %>%
  separate(sample, c("time", "proc"), sep="_") %>%
  select(log_rsq_adj, time, proc, subj_id, task_name) %>%
  spread(proc, log_rsq_adj) 

tmp %>%
  ggplot(aes(hierarchical, flat))+
  geom_point()+
  geom_abline(aes(slope=1, intercept=0), color="red")+
  facet_wrap(~time)
```

```{r}
with(tmp %>% filter(time == "t1"), cor.test(flat, hierarchical))
with(tmp %>% filter(time == "retest"), cor.test(flat, hierarchical))
```


```{r}
with(tmp %>% filter(time == "t1"), t.test(flat, hierarchical, paired=TRUE))
with(tmp %>% filter(time == "retest"), t.test(flat, hierarchical, paired=TRUE))
```

## Clustering

-- Do DDM parameters capture similar processes as the raw measures in a given task or do they capture processes that are more similar across tasks?
(If the former they would be less useful than if the latter.)  

This could be analyzed with factor analysis but there are more variables than observations so as a first pass we'll explore correlations.    

For this we correlate each DDM measure with:    
1. raw measures within task  
2. other ddm measures across tasks  

```{r eval=FALSE, echo=FALSE}
#Standardize datasets
test_data_std = test_data %>% mutate_if(is.numeric, scale)
test_data_std = test_data_std %>% select(-sub_id)

retest_data_std = retest_data %>% mutate_if(is.numeric, scale)
retest_data_std = retest_data_std %>% select(-sub_id)

#Get correlation table for test data and melt into long form
test_data_cor = data.frame(cor(test_data_std, use="pairwise.complete.obs"))
test_data_cor = test_data_cor %>%
  mutate(dv = row.names(.)) %>%
  gather(key, value, -dv) %>%
  filter(value != 1 & duplicated(value)==FALSE)

#Add measure type info to long form correlation table
test_data_cor = test_data_cor %>%
  left_join(rel_df %>% select(dv, task_group, raw_fit, rt_acc, ddm_raw), by = "dv") %>%
  rename(var_1 = dv, dv = key, task_group_1 = task_group, raw_fit_1 = raw_fit, rt_acc_1 = rt_acc, ddm_raw_1 = ddm_raw) %>%
    left_join(rel_df %>% select(dv, task_group, raw_fit, rt_acc, ddm_raw), by = "dv") %>%
  rename(var_2 = dv, task_group_2 = task_group, raw_fit_2 = raw_fit, rt_acc_2 = rt_acc, ddm_raw_2 = ddm_raw) %>%
  #task-task: if the measures are from the same task and one of them is a ddm measure while the other is raw
  #ddm-ddm: if the measures are NOT from the same task and they are both ddm measures
  mutate(task_task = ifelse((task_group_1 == task_group_2) & ((ddm_raw_1 == "ddm" & ddm_raw_2 == "raw") | (ddm_raw_1 == "raw" & ddm_raw_2 == "ddm")), "task-task", ifelse((task_group_1 != task_group_2) & (ddm_raw_1 == "ddm" & ddm_raw_2 == "ddm"), "ddm-ddm",NA)), 
         time="test")

#summarise by relationship between correlated variables for plotting
test_data_cor_med = test_data_cor %>%
  na.exclude() %>%
  group_by(task_task) %>%
  summarise(median_abs_cor = median(abs(value)),
            mean_abs_cor = mean(abs(value)),
            time="test")

#Get same correlation table for retest data
retest_data_cor = data.frame(cor(retest_data_std, use="pairwise.complete.obs"))
retest_data_cor = retest_data_cor %>%
  mutate(dv = row.names(.)) %>%
  gather(key, value, -dv) %>%
  filter(value != 1 & duplicated(value)==FALSE)

retest_data_cor = retest_data_cor %>%
  left_join(rel_df %>% select(dv, task_group, raw_fit, rt_acc, ddm_raw), by = "dv") %>%
  rename(var_1 = dv, dv = key, task_group_1 = task_group, raw_fit_1 = raw_fit, rt_acc_1 = rt_acc, ddm_raw_1 = ddm_raw) %>%
    left_join(rel_df %>% select(dv, task_group, raw_fit, rt_acc, ddm_raw), by = "dv") %>%
  rename(var_2 = dv, task_group_2 = task_group, raw_fit_2 = raw_fit, rt_acc_2 = rt_acc, ddm_raw_2 = ddm_raw) %>%
  mutate(task_task = ifelse((task_group_1 == task_group_2) & ((ddm_raw_1 == "ddm" & ddm_raw_2 == "raw") | (ddm_raw_1 == "raw" & ddm_raw_2 == "ddm")), "task-task", ifelse((task_group_1 != task_group_2) & (ddm_raw_1 == "ddm" & ddm_raw_2 == "ddm"), "ddm-ddm",NA)),
         time="retest")

retest_data_cor_med = retest_data_cor %>%
  na.exclude() %>%
  group_by(task_task) %>%
  summarise(median_abs_cor = median(abs(value)),
            mean_abs_cor = mean(abs(value)), 
            time = "retest")

all_data_cor = rbind(test_data_cor, retest_data_cor)
all_data_cor_med = rbind(test_data_cor_med, retest_data_cor_med)

all_data_cor %>%
  na.exclude() %>%
  ggplot(aes(abs(value)))+
  geom_histogram()+
  geom_vline(data=all_data_cor_med, aes(xintercept=median_abs_cor), color="red", linetype = "dashed")+
  facet_grid(time~task_task)+
  xlab("Absolute correlation")
```

(Absolute) correlations between raw and ddm measures within a task are higher than those between ddm measures across tasks.

```{r  eval=FALSE, echo=FALSE}
all_data_cor_med
```

```{r  eval=FALSE, echo=FALSE}
rm(all_data_cor, all_data_cor_med, retest_data_cor, retest_data_cor_med, test_data_cor, test_data_cor_med)
```

Factor analysis

```{r}

```

-- Are these clusters more reliable than using either the raw or the DDM measures alone?

## Prediction

-- Do raw or DDM measures (or factor scores) predict real world outcomes better?

## To Do

- Alternate DDMs  
- Time and accuracy costs vs DDMs (Hedge et al, 2018, Psych Bull)  
- Trial by trial variability as individual difference measres (Rouder and Haaf, psyArXiv)  